Add support for advanced.config #221

ChunyiLyu · 2020-07-22T17:50:35Z

This closes #219

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed

Summary Of Changes

configured through spec.rabbitmq.advancedConfig as a string
updating advancedConfig will trigger a sts rolling restart bc similar to rabbitmq.conf, advanced config is loaded at node start
before restarting the statefulSet, controller runs rabbitmqctl eval <advanced.config> to validate that the input has correct syntax. rabbitmqctl eval evaluate the input, so it's not a minimal syntax validation. Some configurations in advanced.config can be loaded while rabbit is running, and some can only to loaded at node start. I have look into erl -noshells -eval or writing an escript, but could not figure out how to successfully use them. So I settled at using rabbitmqctl eval.

Additional Context

while working on this issue, I realized that there is a limitation with how the controller is handling plugins and configurations update, and here is the created issue: Coordinate plugins and configurations update #224

Local Testing

Have run unit, integration and system tests.

This is tested at the unit and system tests level.

- related to issue # - configured through `spec.rabbitmq.advancedConfig` as a string - updating advancedConfig will trigger a sts rolling restart bc similar to `rabbitmq.conf`, advanced config is loaded at node start

michaelklishin · 2020-07-22T19:31:50Z

config/crd/bases/rabbitmq.com_rabbitmqclusters.yaml

@@ -4476,6 +4472,10 @@ spec:
                    type: string
                  maxItems: 100
                  type: array
+                advancedConfig:
+                  description: Specify any rabbitmq advanced.config configurations
+                  maxLength: 2000


Is this limit required? 2000 characters can be exceeded in environments with extensive LDAP queries and OAuth 2 settings, for example.

It's not required. There is a hard limit in kubernetes that all objects are stored in etcd, and etcd has a limit of 1MB per object which translate to 1048576 characters. I will update the limit to 100000. Thanks for pointing it out Michael!

Zerpet

I left a cosmetic suggestion, nothing that stops the ✅ 👌

Zerpet · 2020-07-23T15:35:59Z

internal/resource/configmap_test.go

+			advancedConfig, ok := configMap.Data["advanced.config"]
+			Expect(ok).To(BeTrue())


Cosmetic change request 🙃 Let's annotate the assertion so that if this ever fails, we get a message more descriptive than "expected false to be true"

updated in the new commit ;) good call Aitor

ChunyiLyu · 2020-07-24T08:39:08Z

Working on adding syntax validation before updating advance.config and restarting the statefulSet

* Add client service override `spec.override.clientService` - this closes #142 - add an additional port example; using the client service override - make sts and svc override's name and namespace not configurable

Pinning to a specific SHA no longer makes sense with no OSL process.

Zerpet

In the "failure" scenario where an invalid advanced config is provided, I don't think we are loud enough to surface that there is a problem. I think that updating the status condition and logging an event is not enough, because the user has to decide to take the additional step of describing the RabbitmqCluster resource.

In the current implementation, I can't think of any motivation for the user to take a diagnosing step, given that all pods are running and "everything just works". It'll be at a later time, when RabbitMQ is not doing what's expected (due to advanced config not being loaded), that a user would suspect there is something not right. And even at this point, I don't think that describing the custom resource is a step one would think of taking as a first step.

I think this safeguard is protecting availability in a scenario where availability is not expected i.e. single RabbitMQ node. In a single node scenario, a user was expecting to lose availability due to the node restart to apply the advanced config. I don't think that preventing the restart (when it was expected) and never reconciling again (due to #225) are a better alternative to letting RMQ crash due to invalid configuration.

In the scenario of a 3 node cluster, the StatefulSet controller will preserve availability of the cluster because it won't roll the surviving 2 replicas until the first one is ready which will never be due to the invalid advanced config.

The latest changes have modified the behaviour of the Operator and the original review is no longer valid.

ChunyiLyu · 2020-07-28T14:04:41Z

@Zerpet

You are totally right that the implementation here is far from ideal. This is part of the reason why I think it is important for us to figure out a more reliable way to load updated configurations and created issue #225

In my opinion, it seems very easy for anyone to make syntax mistakes with their advanced config, and a rabbitmq-server node crash is something we would like to avoid. It's true that users expects downtime when they updates configurations because of the restart. However, a node crashing not only occurs downtime but also panic and it sounds like unnecessary pain for our users. I think given that users expect a restart of the statefulSet when they update configurations, and they are most likely manually performing this action, the fact that statefulSet stays running and are not restarted is a signal to them that something has gone wrong. status.condition is also the recommended way to monitor k8s objects, and users should be actively monitoring that.

This all depends on how much we want to prevent a server crash. If a server crash is something users are totally ok with and are used to handle, remove the validation and let it happen makes sense to me. However, if it is a failure mode that causes pain for users, I do not see why we won't add an additional validation to avoid arriving there.

I would like to hear other's opinions on this subject as well. Thoughts? @mkuratczyk @j4mcs

Zerpet · 2020-07-28T15:02:07Z

However, a node crashing not only occurs downtime but also panic and it sounds like unnecessary pain for our users. I think given that users expect a restart of the statefulSet when they update configurations, and they are most likely manually performing this action, the fact that statefulSet stays running and are not restarted is a signal to them that something has gone wrong.

I think that a node crash, when a user was expecting downtime, does not incur in panic or extra pain. The user was expecting downtime anyway. In my opinion, a Pod that enters a crash loop backoff is a more evident, direct, right-in-your-face symptom that there is a problem. A Pod that is just running, when you expected it to roll, doesn't seem so evident to me that there is a problem. Maybe the change did not require a Pod restart? Plugins do not restart Pods, so maybe this change doesn't either? Maybe the update has not been scheduled yet? It's not evident unless the user takes further actions.

status.condition is also the recommended way to monitor k8s objects, and users should be actively monitoring that.

I've been wondering how true this statement is in the world outside. For us, it is very clear. I have not yet heard any mention of conditions, or the CR status in any feedback communication we had so far. I've been wondering for a while if the status conditions isn't something we convinced ourselves that "it is the way". Even when I'm troubleshooting a problem during development, I rarely check status conditions or status objects unless I have a good reason to; instead I find myself doing kubectl get all and kubectl logs. I'm not convinced that "you should monitor your status conditions" is a widely adopted or understood practice.

This all depends on how much we want to prevent a server crash.

Phrasing my arguments in other way: what does a server crash cause? Downtime. A single RabbitMQ node was expecting downtime, so we are not avoiding an unexpected event. In a 3 node cluster, what does one Pod in crash loop backoff cause? Degraded availability, no downtime. The service continuity remains intact.

I definitely think that a Pod in crash loop backoff is a much better option than silently masking the error in a status condition.

mkuratczyk · 2020-07-28T17:09:24Z

I see good points and both sides and haven't made my mind up yet. However, aren't validation webhooks the right tool for this job? If we just rejected changes to the CR if advancedConfig is invalid then we prevent downtime and we are explicit about the problem.

ChunyiLyu · 2020-07-29T08:06:04Z

@Zerpet @mkuratczyk validation webhooks will be a more ideal place. I've deleted the logic on validation.

The branch history looks a bit messy at the moment. I will squash merge.

j4mcs · 2020-07-29T09:18:21Z

I'm a bit wary of spending too much time in K8s concerned with a RabbitMQ fault. Ultimately if the user misconfigures their RabbitMQ cluster, the failure mode will be determined by RabbitMQ. I feel like we're trying to patch behaviour of the underlying technology with reconciliation logic. If an invalid advanced config should result in a spectacular failure of the RabbitMQ cluster then that's the core's responsibility.

I agree that we've made a conscious opinionated decision that conditions are where we surface underlying faults. But I don't see an issue with that. Ultimately, we've taken a cue from Knative and it's definitely not an established standard.

I worry that unless we let misconfigurations like these run their course we're going to have more and more edge cases bloat reconciliation.

All that to say I'm happy with the code as it is. I'd be up for exploring a validating webhook, but I think it needs to be opt in, least of all because it will mean loading erlang to do syntax checking.

Zerpet · 2020-07-29T09:46:40Z

I agree and resonate with James' comment. I'm happy with the current state and in favour of exploring validating webhooks.

ChunyiLyu · 2020-07-29T11:28:49Z

In my opinion, the benefit our operator can provide, is the improvement on operating core rabbitmq on k8s. The operator should make the experience of deploying and running rabbitmq on k8s easier for users than them using own solution. If we were to say that experience improvement does not belong in the operator, then how are we offering anything different from a templating tool that the bitnami chart is offering for 2 years already? I don't know the answer to this question, nor do I have a clean definition of what counts as a reasonable experience improvement. However, I disagree with the the statement that we should avoid patching rabbitmq behavior with our operator.

The PR is merged. I'm fine the next step that is proposed. We can chat about the responsibility boundary of the operator some other time. This will definitely be a topic that comes up repeatedly.

* Add support for advanced.config - related to issue #219 - configured through `spec.rabbitmq.advancedConfig` as a string - updating advancedConfig will trigger a sts rolling restart bc similar to `rabbitmq.conf`, advanced config is loaded at node start - set maxLength for advancedConfig to 100000

Add support for advanced.config

1aaed67

- related to issue # - configured through `spec.rabbitmq.advancedConfig` as a string - updating advancedConfig will trigger a sts rolling restart bc similar to `rabbitmq.conf`, advanced config is loaded at node start

michaelklishin reviewed Jul 22, 2020

View reviewed changes

Set maxLength for advancedConfig to 100000

d141b40

Zerpet previously approved these changes Jul 23, 2020

View reviewed changes

ChunyiLyu and others added 2 commits July 27, 2020 20:28

Add client service override spec.override.clientService (#213)

5269278

* Add client service override `spec.override.clientService` - this closes #142 - add an additional port example; using the client service override - make sts and svc override's name and namespace not configurable

Bump Dockerfile to latest ubuntu

9dd84f4

Pinning to a specific SHA no longer makes sense with no OSL process.

Zerpet reviewed Jul 28, 2020

View reviewed changes

coro mentioned this pull request Jul 28, 2020

Allow configuration through rabbitmq-env.conf #226

Merged

Add annotations to configMap unit tests

d0ceb1c

ChunyiLyu force-pushed the advancedCon branch from 2df73e6 to d0ceb1c Compare July 29, 2020 08:08

Merge branch 'master' into advancedCon

960252a

Merge branch 'master' into advancedCon

e4184cf

j4mcs approved these changes Jul 29, 2020

View reviewed changes

Zerpet approved these changes Jul 29, 2020

View reviewed changes

ChunyiLyu merged commit 7a6946b into master Jul 29, 2020

ChunyiLyu deleted the advancedCon branch July 29, 2020 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for advanced.config #221

Add support for advanced.config #221

ChunyiLyu commented Jul 22, 2020 •

edited

Loading

michaelklishin Jul 22, 2020

ChunyiLyu Jul 23, 2020

Zerpet left a comment

Zerpet Jul 23, 2020

ChunyiLyu Jul 27, 2020

ChunyiLyu commented Jul 24, 2020

Zerpet left a comment

ChunyiLyu commented Jul 28, 2020

Zerpet commented Jul 28, 2020

mkuratczyk commented Jul 28, 2020

ChunyiLyu commented Jul 29, 2020 •

edited

Loading

j4mcs commented Jul 29, 2020

Zerpet commented Jul 29, 2020

ChunyiLyu commented Jul 29, 2020

		advancedConfig, ok := configMap.Data["advanced.config"]
		Expect(ok).To(BeTrue())

Add support for advanced.config #221

Add support for advanced.config #221

Conversation

ChunyiLyu commented Jul 22, 2020 • edited Loading

Summary Of Changes

Additional Context

Local Testing

michaelklishin Jul 22, 2020

Choose a reason for hiding this comment

ChunyiLyu Jul 23, 2020

Choose a reason for hiding this comment

Zerpet left a comment

Choose a reason for hiding this comment

Zerpet Jul 23, 2020

Choose a reason for hiding this comment

ChunyiLyu Jul 27, 2020

Choose a reason for hiding this comment

ChunyiLyu commented Jul 24, 2020

Zerpet left a comment

Choose a reason for hiding this comment

ChunyiLyu commented Jul 28, 2020

Zerpet commented Jul 28, 2020

mkuratczyk commented Jul 28, 2020

ChunyiLyu commented Jul 29, 2020 • edited Loading

j4mcs commented Jul 29, 2020

Zerpet commented Jul 29, 2020

ChunyiLyu commented Jul 29, 2020

ChunyiLyu commented Jul 22, 2020 •

edited

Loading

ChunyiLyu commented Jul 29, 2020 •

edited

Loading